Nispero: a cloud-computing based Scala tool specially suited for bioinformatics data processing
نویسندگان
چکیده
Nowadays it is widely accepted that the bioinformatics data analysis is a r eal b ottleneck i n many r esearch act ivities r elated t o l ife s ciences. H ighthroughput t echnologies l ike N ext Generation S equencing (NGS) ha ve completely r eshaped t he bi ology a nd bi oinformatics l andscape. U ndoubtedly NGS has allowed important progress in many life-sciences related fields but has also p resented i nteresting ch allenges i n t erms o f co mputation cap abilities an d algorithms. M any kinds o f ta sks r elated w ith N GS d ata a nalysis, as w ell as other bioinformatics data analysis, can be computed in a parallel, independent way; taking the maximum advantage o f this can obviously help in leveraging the analysis bottleneck. Given the way NGS data is generated scalability plays also an important role in its a nalysis. N GS da ta i s not generated i n a c ontinous fashion but i n a ba tch way, t hus t he co mputation n eeds can b e d ramatically d ifferent at d ifferent points. Cloud c omputing pr ovides a pe rfect framework for s ystems with t hese t wo requirements: parallel and scalable. Besides, it allows adjusting the computation power on demand, a nd t hus n ot be ing a ttached t o ( and pa ying f or) a f ixed compute infrastructure. Nispero is a Scala library for declaring stateless computations and scaling them using c loud c omputing, i n pa rticular a c ombination of s ervices f rom AWS (Amazon Web Services). Some highlights are: • strongly typed configuration based on Scala code • CRDT-like semantics ( a n ispero i nstance i s es sentially a morphism between idempotent commutative monoids) • automatic deploy/undeploy Nispero r elies on t he E C2 s ervice ( Elastic C ompute C loud) t o carry out t he computations, on the S3 service (Simple Storage Service) for data storage and on S QS ( Simple Q ueue S ervice) a nd S NS ( Simple N otification S ervice) for communication between the different system components. A Nispero system is composed by: • a 'console' instance that tracks at any moment the status of the whole system g iving t he us er t he opp ortunity t o c heck a t a ny poi nt the current status of the computations, workers, etc. • a 'manager' i nstance that i s i n charge of deploying and undeploying the group of workers Proceedings IWBBIO 2014. Granada 7-9 April, 2014 1414 • a s et o f ' workers' t hat p erforms t he co mputations/tasks i n a p arallel, independent way • SQS queues for 'input', 'output' and 'error' messages • S3 objects for 'input' and 'output' files The lifecycle of a Nispero system is simple but robust. It starts with the launch of the 'console' and 'manager' instances, the 'manager' then takes the tasks from an S 3 o bject, publishes t hem i n a S QS que ue a nd l aunches t he workers. The workers t ake the messages with the t asks from the corresponding SQS queue (i.e. the 'input' queue) in an independent, parallel way. Once they have finished the computation they put the results of the computation in S3 objects, publish a message i n t he ' output' S QS queue a nd de lete t he i nput m essage o f t he corresponding task from the 'input' queue. Nispero is an open-source project released under AGPLv3 license. The source code is available at https://github.com/ohnosequences/nispero This project is funded in part by the ITN FP7 project INTERCROSSING (Grant 289974).
منابع مشابه
SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision
UNLABELLED Many time-consuming analyses of next -: generation sequencing data can be addressed with modern cloud computing. The Apache Hadoop-based solutions have become popular in genomics BECAUSE OF: their scalability in a cloud infrastructure. So far, most of these tools have been used for batch data processing rather than interactive data querying. The SparkSeq software has been created to ...
متن کاملAn Efficient Resource Allocation for Processing Healthcare Data in the Cloud Computing Environment
Nowadays, processing large-media healthcare data in the cloud has become an effective way of satisfying the medical userschr('39') QoS (quality of service) demands. Providing healthcare for the community is a complex activity that relies heavily on information processing. Such processing can be very costly for organizations. However, processing healthcare data in cloud has become an effective s...
متن کاملCloud Computing Technology Algorithms Capabilities in Managing and Processing Big Data in Business Organizations: MapReduce, Hadoop, Parallel Programming
The objective of this study is to verify the importance of the capabilities of cloud computing services in managing and analyzing big data in business organizations because the rapid development in the use of information technology in general and network technology in particular, has led to the trend of many organizations to make their applications available for use via electronic platforms hos...
متن کاملA Model based on Cloud Computing for the implementation and management IT services in Banks
In recent years, the banking industry has made significant changes in technology and communications. The expansion of electronic communications and a large number of people around the world access to the Internet, appropriate to establish trade and economic exchanges provided but high costs, lack of flexibility and agility in existing systems because of the large volume of information, confiden...
متن کاملA Novel Method for VANET Improvement using Cloud Computing
In this paper, we present a novel algorithm for VANET using cloud computing. We accomplish processing, routing and traffic control in a centralized and parallel way by adding one or more server to the network. Each car or node is considered a Client, in such a manner that routing, traffic control, getting information from client and data processing and storing are performed by one or more serve...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014